Methods and apparatus for improving the breathing of disk scheduling algorithms

ABSTRACT

A method for breathing of scheduling algorithms for a storage device ( 110 ). The method including: (a) computing a worst-case duration of a breathing cycle (P) for the storage device ( 110 ); (b) starting a breathing cycle; (c) determining if one of the following becomes true before the end of P: (i) a number of real-time requests is at least a predetermined threshold based on a number of data streams and performance parameters of the storage device; and (ii) a number of pending requests for any single stream becomes more than one; (d) if at least one of (i) and (ii) remain true during the duration of P, starting a subsequent breathing cycle after completion of the breathing cycle; and (e) if both of (i) and (ii) are not true during the duration of P, waiting P time units from the start of the breathing cycle before starting the subsequent breathing cycle.

The present invention relates generally to breathing of disk schedulingalgorithms, and more particularly, to methods for improving thebreathing by delaying the handling of real-time requests, providedcertain conditions are not satisfied, to alleviate problems associatedwith prior art breathing of disk scheduling algorithms.

The current cycle-based disk scheduling algorithms serve disk requestsbatch by batch, in such a way that for each stream at most one diskaccess is being handled per batch. Given that a maximum bit rate isknown for each of the streams and given an accurate model of theguaranteed throughput of the disk, the duration of a worst-case cycle isdetermined, and based on that cycle, the buffer sizes and the sizes ofthe blocks that are repeatedly transferred between the disk and thebuffers are determined. In addition, bandwidth for so-called best-effortrequests can be reserved by reserving a given fraction of the worst-casecycle for best-effort requests.

The disk algorithms are called breathing, in the sense that a new batchis handled immediately when the previous batch is completed. Given thatworst-case assumptions on both the bit rate of the streams and the diskperformance parameters are incorporated in determining the worst-casecycle, the buffer sizes, and the block sizes, the number of real-timerequests per batch will usually be substantially smaller than the numberof streams. If the current load on the disk is low, then the batcheswill be small, and the response time on starting up a new stream orissuing a large best-effort request will be small. Whenever, the load onthe disk increases, it will automatically start handling larger batches,and in doing so, spend less time on switching. So, whenever required,the disk will automatically start working more efficiently. Anadditional advantage of this breathing behavior is that the algorithm isless sensitive to rogue streams, where a rogue stream is a stream thattemporarily or continuously requires a higher bit rate than assumed.Whenever the algorithm has available bandwidth, it will automaticallyserve those streams more frequently than was accounted for in theworst-case analysis, without endangering the performance that wasguaranteed to the other streams. So all available bandwidth that becomesavailable because of a better than worst-case performance of the diskand because of some other streams requiring a bit rate that is smallerthan the assumed maximum is not wasted but can be spent on roguestreams.

However, these breathing algorithms are associated with somedisadvantages. To immediately start a new batch, whenever the previousone is completed has the following disadvantages:

1. An application that issued best-effort requests only one at a time,which is usual in normal PC applications, will be able to issue only onebest-effort request per cycle. If this application wants to issue manysmall requests, but blocks whenever one request is issued, then theresulting average bit rate can be quite low.

2. Generally, the average required bit rate by the real-time streamswill be considerably smaller than is worst case required. As a result,the number of real-time requests per cycle will be quite small. Theeffect is that a relatively large fraction of the time is spent onseeking. This has an adverse effect on the energy consumption, on thelife time of the disk, and on the noise it produces when in operation.

Therefore it is an object of the present invention to provide animprovement in the breathing of disk scheduling algorithms over thatwhich is currently utilized in the art. Accordingly, a method forbreathing of scheduling algorithms for a storage device is provided. Themethod comprising: (a) computing a worst-case duration of a breathingcycle for the storage device, the worst-case duration being referred toby P; (b) starting a breathing cycle; (c) determining if one of thefollowing becomes true before the end of P time units: (i) a number ofreal-time requests is at least a predetermined threshold based on anumber of data streams and performance parameters of the storage device;and (ii) a number of pending requests for any single stream becomes morethan one; (d) if at least one of (i) and (ii) remain true during theduration of P time units from the start of the breathing cycle, startinga subsequent breathing cycle after completion of the breathing cycle;and (e) if both of (i) and (ii) are not true during the duration of Ptime units from the start of the breathing cycle, waiting P time unitsfrom the start of the breathing cycle before starting the subsequentbreathing cycle.

Preferably, best-effort requests arriving during the breathing cycle arehandled during the breathing cycle.

The method preferably further comprises, repeating steps (b)-(e) for aplurality of breathing cycles.

The method preferably further comprises (f) calculating an actualbit-rate of a data stream based on the determination of step (ii). Inwhich case, the method further comprises (g) changing a bit rate for thedata stream based on the calculating in step (f). Preferably, step (g)comprises reserving a higher bit rate for the data stream where itsestimated maximum bit rate is exceeded where the data stream istransferred between the storage device and a buffer, the method furthercomprises (h) increasing the buffer size. Alternatively, step (g)comprises reserving a lower bit rate for the data stream where itsestimated maximum bit rate is not exceeded where the data stream istransferred between the storage device and a buffer, the method furthercomprises (h) decreasing the buffer size.

Also provided is a storage device scheduler for controlling thebreathing of scheduling algorithms for a storage device. The storagedevice scheduler comprising: (a) means for computing a worst-caseduration of a breathing cycle for the storage device, the worst-caseduration being referred to by P; (b) means for instructing the startingof a breathing cycle; (c) means for determining if one of the followingbecomes true before the end of P time units: (i) a number of real-timerequests is at least a predetermined threshold based on a number of datastreams and performance parameters of the storage device; and (ii) anumber of pending requests for any single stream becomes more than one;(d) means for starting a subsequent breathing cycle after completion ofthe breathing cycle if at least one of (i) and (ii) remain true duringthe duration of P time units from the start of the breathing cycle; and(e) means for waiting P time units from the start of the breathing cyclebefore starting the subsequent breathing cycle if both of (i) and (ii)are not true during the duration of P time units from the start of thebreathing cycle.

The storage device scheduler preferably further comprises repeatingsteps (b)-(e) for a plurality of breathing cycles.

The storage device scheduler preferably further comprises (f)calculating an actual bit-rate of a data stream based on thedetermination of step (ii). In which case the storage device schedulerpreferably further comprises (g) changing a bit rate for the data streambased on the calculating in step (f). Preferably, step (g) comprisesreserving a higher bit rate for the data stream where its estimatedmaximum bit rate is exceeded where the data stream is transferredbetween the storage device and a buffer, the method further comprises(h) increasing the buffer size. Alternatively, step (g) comprisesreserving a lower bit rate for the data stream where its estimatedmaximum bit rate is not exceeded where the data stream is transferredbetween the storage device and a buffer, the method further comprises(h) decreasing the buffer size.

Also provided are a computer program product for carrying out themethods of the present invention and a program storage device for thestorage of the computer program product therein.

These and other features, aspects, and advantages of the apparatus andmethods of the present invention will become better understood withregard to the following description, appended claims, and accompanyingdrawing where the Figure illustrates a schematic view of a systemarchitecture for use with the methods of the present invention.Referring now to the Figure, there is illustrated a schematicrepresentation of a system architecture, generally referred to byreference numeral 100. Those skilled in the art will appreciate that thesystem architecture in the Figure is given by way of example only andnot to limit the scope or spirit of the present invention. The methodsof the present invention can be utilized in any system that employsdisk-scheduling algorithms for retrieving data from a storage device,such as a hard drive disk.

FIG. 1 schematically shows a block diagram of a system 100 for use withthe methods of the present invention. Examples of such systems 100 arehard disk recorders, set-top boxes, television sets, audio jukeboxes,and video and multimedia servers. Multimedia applications can becharacterized by an extensive use of audio-visual material. The methodsof the invention are applicable to (VOD) servers as well as homeservers, personal video recorders, portable multimedia devices, and anydevice that uses a disk or other storage device to play out audio orvideo.

For the playback of audio or video a (near-) continuous supply ofaudio/video data is required. Known examples of a multimedia serverinclude a near-video-on-demand server and a video-on-demand server. In anear-video-on-demand system the service provider determines when a titleis reproduced. A data stream containing the data of the title may bereceived by many users simultaneously. In a video-on-demand system,typically the user selects a title and controls, with VCR-like controls,the reproduction of the title. The level of interaction is higher and adata stream is typically only consumed by one user. A multimedia serveris usually implemented using a file server that is specifically designedto supply continuous data streams for a large number of users inparallel.

Usually, one or more multimedia titles are stored on a backgroundstorage medium 110. Typically, disks, such as hard disks, are used asthe background storage medium 110, based on their large storage capacityat low cost and the possibility of random access. It will be appreciatedthat also other storage media, such as optical disks, tape, or evensolid-state memory may be used. The storage medium 110 is preferably asingle storage device, but may be divided into a plurality of storageunits in which case data can be striped across the multiple disks insuch a way that a request to read or write a block of data results in adisk access on each disk.

The system 100 comprises a reader 180 for reading data from the storagemedium 110. The reader 180 may be implemented using a SCSI or IDEinterface. Advantageously, the storage medium 110 is also included inthe system 100. For a disk oriented storage medium 110, data isretrieved in units of a Disk Access Block (DAB), where a DAB is formedby a sequence of consecutive sectors. The reader 180 may also include acaching memory for temporarily storing data read from the disk beforesupplying the data, potentially in a different sequence than read fromdisk, via a bus 140 to the remainder of the system 100. Particularly forvideo, a data stream may be very voluminous. To reduce the volume,typically, compression techniques are used. The compression scheme mayresult in a fixed rate data stream, for instance using a fixed rate formof MPEG-1 encoding, or a variable rate data stream, for instance using avariable rate form of MPEG-2 encoding. The system may be used for fixedrate systems as well as variable rate systems. Normally, the data isstored in the storage medium 110 and processed by the system 100 in acompressed form. Only at the user 130 the data stream is decompressed,using a decoder. Particularly for a variable rate system, the system 100may also be able to support VCR-like control functions. The system 100maintains for the data stream a stream status that indicates the currentstate. The stream status for one or more data streams may be stored in astatus memory 190, such as the main memory (RAM) of the server orspecial registers. Data is read from the storage medium 110 for a batchof data streams where the data of the batch is supplied as a timemultiplexed stream via the bus 140. The storage medium 110 is notcapable of simultaneously supplying continuous data streams to all usersof the system. Instead, data for a subset of data streams is read andsupplied to the remainder of the system 100 at a higher data rate thenconsumed by the corresponding data streams. The system 100, therefore,comprises buffers 125 for achieving supply of data at the required rateto the users 130. Usually, the buffers 125 are implemented using RAM ina part 120 of the system's memory. The system 100 further comprisescommunication means 150 for transferring data of the data streams tousers. The communication means 150 may be formed by any suitable means,such as a local area network, for supplying the data to users locatednear the system 100. In practice, a telecommunication or cable networkis used for supplying the data over a longer distance.

The system 100 also comprises a control unit 160 for controlling thesystem 100. A main part of the control unit is formed by the scheduler170, which determines which DABs should be read by the reader 180 fromthe storage medium 110 in order to avoid that an underflow or overflowof the buffers 125 occurs. The control unit is typically formed by aprocessor, such as a RISC-, or CISC-type microprocessor, which isoperated under control of a real-time operating system, loaded from astorage medium, such as ROM or a hard disk. The scheduler 170 may beimplemented as a software module integrated into the operating system orloaded as an application program. Typically, the scheduler 170 receivesstatus information, such as a filling degree of the buffers, upon whichthe scheduler 170 bases its decision. For systems that offer VCR-likecontrols, the scheduler also receives information regarding the statusof a stream. In such systems, typically, control information is receivedfrom the users 130 via the communication means 150. Where the storagedevice 100 is a hard disk, data is stored in concentric circles, calledtracks, on the disk. Each track consists of an integer number ofsectors. Tracks near the outer edge of a disk may contain more sectorsthan tracks near the inner edge. For this purpose, modern disks arrangethe set of tracks in non-overlapping zones, where the tracks in a zonehave the same number of sectors and different zones correspond todifferent number of sectors per track. Typically, a disk rotates at aconstant angular velocity, so that reading from tracks in a zone nearthe outer edge results in a higher data transfer rate than reading fromtracks in a zone near the inner edge. The time required for accessingdata from the disk is mainly determined by: a seek time, i.e., the timeneeded to move the reading head to the desired track, a rotationallatency, i.e., the time that passes before the required data moves underthe reading head once the track has been reached, and a read time, i.e.,the time needed to actually read the data. The sum of the seek time andthe rotational latency is referred to as switch time. The read timedepends on the amount of data to be read and the radial position of thetrack (s) on which the data is stored. The rotational latency per accesstakes at most one revolution of the disk. The seek time per access ismaximal if the reading head has to be moved from the inner edge to theouter edge of the disk, or vice versa. To avoid that such a maximum,seek has to be taken into account for each access, disk accesses arehandled in batches, called a sweep. As the head moves from the inneredge to the outer edge, or vice versa, the required data blocks are readin the order in which they are encountered on disk.

The methods of the present invention will now be described withreference to the Figure. Focusing on a triple buffering algorithm, as isknown in the art, such as that presented by Korst et al. Disk schedulingfor variable-rate data streams, Proc. European Workshop on InteractiveDistributed Multimedia Systems and Telecommunication Services, IDMS'97,Darmstadt, September 10-12, Lecture Notes in Computer Science,1309(1997)119-132 (1997), the breathing is preferably adjusted asfollows. However, those skilled in the art will appreciate that otheralgorithms now known or later developed can also be utilized with themethods of the present invention. Let the worst-case duration of a cyclebe given by P, wherein a cycle is defined as the time interval that isused to retrieve typically (at most) one DAB for each data stream,possibly reserving additional time for handling best-effort requests.The calculation of the worst-case duration of a cycle is well known inthe art. Whenever a cycle is completed, the next cycle can be startedimmediately thereafter, as is currently implemented in the art, however,in the methods of the present invention the start of the next cycle canbe delayed for P time units after the completed cycle. Provided that allstreams do not exceed their maximum bit rates, it can still beguaranteed that buffer underflow and buffer overflow will not occur inthe latter case. Both approaches can be considered as two extremes of arange of possible strategies to determine when the next cycle is to bestarted.

This alternative strategy that better balances the advantages anddisadvantages of breathing will now be described in more detail. Thestrategy of the methods of the present invention is to wait until P timeunits have passed since the start of the previous cycle to start thenext cycle, unless one of the two following conditions becomes truebefore this time. If one of these conditions becomes true, then the nextcycle is started immediately. As long as the next cycle is not started,the disk need not to be idle but best-effort requests can be handled insuccessive batches. Best-effort requests should be handled in such a waythat the next cycle that includes real-time requests is not startedlater than the P time units after the start of the current cycle.Furthermore, after the completion of each batch of best-effort requests,the conditions should be checked.

The conditions to immediately start the next cycle are

1. The number of (real-time) requests is at least nthreshold, wherenthreshold is a suitably chosen number that depends on the number ofstreams n, and on the disk performance parameters. If n ≧5, thennthreshold could be chosen equal to 2 or 3. Three requests per cycleresults in a reasonably efficient use of sweeps.

2. The number of pending requests for a single stream becomes 2. If thestream does not exceed its maximum bit rate, then this condition willnever be true. However, if some stream temporarily or continuouslyexceeds its estimated maximum bit rate, then buffer underflow oroverflow should be avoided, whenever possible.

By starting the next cycle earlier than waiting for P time units topass, a larger bit rate can be offered to such streams, withoutendangering the guaranteed bit rates for other streams. Since bothrequests are now handled in the same sweep, the chances for an efficientsweep improve because two requests of a single stream will often becontiguous. Note that when a new stream is started, it will immediatelyissue two or even three requests. In that case, the above secondcondition is satisfied and the start-up time of the new stream willcorrespondingly be small. The above strategy maintains to a large extendthe advantages of breathing. The average response times will still beconsiderably smaller than worst-case, since on average cycles will stillbe considerably shorter than P time units. Rogue streams can stillreceive a higher bit rate than their maximum bit rate, provided thatthere is available bandwidth to do so.

In addition, it alleviates the disadvantages of breathing to a largeextent. An application that issues small best-effort requests one at atime will be able to issue multiple requests per cycle. As long as thenext cycle is not started, it can repeatedly issue requests. Also, thedisk will behave more energy-efficient, produce less noise, and willhave a longer lifetime, since it will be more idle, especially when onaverage the best-effort load is moderate.

An additional advantage of the above strategy is that checking thesecond condition can be used to provide feedback on the actual bit rateof a stream. If a steam continuously exceeds its estimated maximum bitrate, then the data blocks that are repeatedly transferred between diskand buffer for that stream are increased, i.e., to reserve a higher bitrate for that stream. The buffer size of this stream would be increasedcorrespondingly. In addition, if a stream continuously requires a lowerbit rate than it's estimated maximum bit rate, then adjusting its blockand buffer sizes could also be considered.

EXAMPLE

Applying a disk-scheduling method of the present invention to aprototype system, the following improved results were observed. Inaddition to twelve real-time streams, a large batch of best-effortrequests was issued. With 12 DVB video streams, using a worst-case cyclelength of 600 ms to access an IBM Deskstar 60 GXP 40 GB disk, thefollowing results were obtained:

If the total bit rate of the video streams is 48 Mbit/s, then anincrease of the bit rate for best-effort data from 8.8 Mbit/s to 16.8Mbit/s was observed, when using a preferred implementation of themethods of the present invention where a new cycle is always startedimmediately upon completion of the previous one. If the total bit rateof the video streams is 56 Mbit/s, then an increase of the bit rate forbest-effort data from 0.8 Mbit/s to 9.6 Mbit/s was observed. Hence, thegain greatly depends on the real-time load on the disk. This isconfirmed by other experiments. The best-effort requests consisted of 10blocks of 1 MByte for each experiment. In the above example, P=600 msand n=12 (i.e., twelve real-time streams).

These preliminary experiments indicate at least an increase of thebest-effort rate of a factor of three to six. In addition, trick playthat is (partly) handled as best effort operated with substantially lessproblems.

The methods of the present invention are particularly suited to becarried out by a computer software program, such computer softwareprogram preferably containing modules corresponding to the individualsteps of the methods. Such software can of course be embodied in acomputer-readable medium, such as an integrated chip or a peripheraldevice.

While there has been shown and described what is considered to bepreferred embodiments of the invention, it will, of course, beunderstood that various modifications and changes in form or detailcould readily be made without departing from the spirit of theinvention. It is therefore intended that the invention be not limited tothe exact forms described and illustrated, but should be constructed tocover all modifications that may fall within the scope of the appendedclaims.

1. A method for breathing of scheduling algorithms for a storage device,the method comprising: (a) computing a worst-case duration of abreathing cycle for the storage device, the worst-case duration beingreferred to by P; (b) starting a breathing cycle; (c) determining if oneof the following becomes true before the end of P time units: (i) anumber of real-time requests is at least a predetermined threshold basedon a number of data streams and performance parameters of the storagedevice; and (ii) a number of pending requests for any single streambecomes more than one; (d) if at least one of (i) and (ii) remain trueduring the duration of P time units from the start of the breathingcycle, starting a subsequent breathing cycle after completion of thebreathing cycle; and (e) if both of (i) and (ii) are not true during theduration of P time units from the start of the breathing cycle, waitingP time units from the start of the breathing cycle before starting thesubsequent breathing cycle.
 2. The method of claim 1, whereinbest-effort requests arriving during the breathing cycle are handledduring the breathing cycle.
 3. The method of claim 1, furthercomprising, repeating steps (b)-(e) for a plurality of breathing cycles.4. The method of claim 1, further comprising (f) calculating an actualbit-rate of a data stream based on the determination of step (ii). 5.The method of claim 4, further comprising (g) changing a bit rate forthe data stream based on the calculating in step (f).
 6. The method ofclaim 5, wherein step (g) comprises reserving a higher bit rate for thedata stream where its estimated maximum bit rate is exceeded.
 7. Themethod of claim 6, wherein the data stream is transferred between thestorage device and a buffer, the method further comprises (h) increasinga size of the buffer.
 8. The method of claim 5, wherein step (g)comprises reserving a lower bit rate for the data stream where itsestimated maximum bit rate is not exceeded.
 9. The method of claim 8,wherein the data stream is transferred between the storage device and abuffer, the method further comprises (h) decreasing a size of thebuffer.
 10. A storage device scheduler for controlling the breathing ofscheduling algorithms for a storage device, the storage device schedulercomprising: (a) means for computing a worst-case duration of a breathingcycle for the storage device, the worst-case duration being referred toby P; (b) means for instructing the starting of a breathing cycle; (c)means for determining if one of the following becomes true before theend of P time units: (i) a number of real-time requests is at least apredetermined threshold based on a number of data streams andperformance parameters of the storage device; and (ii) a number ofpending requests for any single stream becomes more than one; (d) meansfor starting a subsequent breathing cycle after completion of thebreathing cycle if at least one of (i) and (ii) remain true during theduration of P time units from the start of the breathing cycle; and (e)means for waiting P time units from the start of the breathing cyclebefore starting the subsequent breathing cycle if both of (i) and (ii)are not true during the duration of P time units from the start of thebreathing cycle.
 11. The storage device scheduler of claim 10, furthercomprising, repeating steps (b)-(e) for a plurality of breathing cycles.12. The storage device scheduler of claim 10, further comprising (f)calculating an actual bit-rate of a data stream based on thedetermination of step (ii).
 13. The storage device scheduler of claim12, further comprising (g) changing a bit rate for the data stream basedon the calculating in step (f).
 14. The storage device scheduler ofclaim 13, wherein step (g) comprises reserving a higher bit rate for thedata stream where its estimated maximum bit rate is exceeded.
 15. Thestorage device scheduler of claim 14, wherein the data stream istransferred between the storage device and a buffer, the method furthercomprises (h) increasing a size of the buffer.
 16. The storage devicescheduler of claim 13, wherein step (g) comprises reserving a lower bitrate for the data stream where its estimated maximum bit rate is notexceeded.
 17. The storage device scheduler of claim 16, wherein the datastream is transferred between the storage device and a buffer, themethod further comprises (h) decreasing a size of the buffer.
 18. Thestorage device scheduler of claim 10, further comprising means forhandling best-effort requests arriving during the breathing cycle duringthe breathing cycle.
 19. A program storage device readable by machine,tangibly embodying a program of instructions executable by the machineto perform method steps for breathing of scheduling algorithms for astorage device, the method comprising: (a) computing a worst-caseduration of a breathing cycle for the storage device, the worst-caseduration being referred to by P; (b) starting a breathing cycle; (c)determining if one of the following becomes true before the end of Ptime units: (i) a number of real-time requests is at least apredetermined threshold based on a number of data streams andperformance parameters of the storage device; and (ii) a number ofpending requests for any single stream becomes more than one; (d) if atleast one of (i) and (ii) remain true during the duration of P timeunits from the start of the breathing cycle, starting a subsequentbreathing cycle after completion of the breathing cycle; and (e) if bothof (i) and (ii) are not true during the duration of P time units fromthe start of the breathing cycle, waiting P time units from the start ofthe breathing cycle before starting the subsequent breathing cycle. 20.A computer program product embodied in a computer-readable medium forbreathing of scheduling algorithms for a storage device, the methodcomprising: (a) computer readable program code means for computing aworst-case duration of a breathing cycle for the storage device, theworst-case duration being referred to by P; (b) computer readableprogram code means for starting a breathing cycle; (c) computer readableprogram code means for determining if one of the following becomes truebefore the end of P time units: (i) a number of real-time requests is atleast a predetermined threshold based on a number of data streams andperformance parameters of the storage device; and (ii) a number ofpending requests for any single stream becomes more than one; (d)computer readable program code means for if at least one of (i) and (ii)remain true during the duration of P time units from the start of thebreathing cycle, starting a subsequent breathing cycle after completionof the breathing cycle; and (e) computer readable program code means forif both of (i) and (ii) are not true during the duration of P time unitsfrom the start of the breathing cycle, waiting P time units from thestart of the breathing cycle before starting the subsequent breathingcycle.